A Lexical Database of Portuguese Multiword Expressions
نویسندگان
چکیده
This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms to strongly lexicalized expressions. This new resource has a two-fold objective: to be an important research tool which supports the development of MW units typologies; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.
منابع مشابه
COMBINA-PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions
This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, interpreted with lexical association measures and manually validated. MW expressions conside...
متن کاملMultilingual Aspects of Multiword Lexical Units
As most of the machine-readable dictionaries contain clearly insufficient information about multiword lexical units, there is a constant need to extend and tune specialized lexical databases to account for new expressions. In this paper, we present a system exclusively based on statistics that massively extracts from unrestricted text corpora contiguous and noncontiguous rigid multiword lexical...
متن کاملImproving Lexical Databases with Collocational Information: Data from Portuguese
This article focuses on ongoing work done for Portuguese concerning the phenomenon of lexical co-occurrence known as collocation (cf. Cruse, 1986, inter al.). Instances of the syntactic variety formed by noun plus adjective have been especially observed. Collocational instances are not lexical entries, and thus should not be stored in the lexicon as multiword lexical units. Their processing can...
متن کاملMultiword Lexical Acquisition And Dictionary Formalization
In this paper, we present the current state of development of a large-scale lexicon built at LabEL1 for Portuguese. We will concentrate on multiword expressions (MWE), particularly on multiword nouns, (i) illustrating their most relevant morphological features, and (ii) pointing out the methods and techniques adopted to generate the inflected forms from lemmas. Moreover, we describe a corpus-ba...
متن کاملRepresentation And Treatment Of Multiword Expressions In Basque
This paper describes the representation of Basque Multiword Lexical Units and the automatic processing of Multiword Expressions. After discussing and stating which kind of multiword expressions we consider to be processed at the current stage of the work, we present the representation schema of the corresponding lexical units in a generalpurpose lexical database. Due to its expressive power, th...
متن کامل